Machine learning-enabled cancer diagnostics with widefield polarimetric second-harmonic generation microscopy

The extracellular matrix (ECM) collagen undergoes major remodeling during tumorigenesis. However, alterations to the ECM are not widely considered in cancer diagnostics, due to mostly uniform appearance of collagen fibers in white light images of hematoxylin and eosin-stained (H&E) tissue sections. Polarimetric second-harmonic generation (P-SHG) microscopy enables label-free visualization and ultrastructural investigation of non-centrosymmetric molecules, which, when combined with texture analysis, provides multiparameter characterization of tissue collagen. This paper demonstrates whole slide imaging of breast tissue microarrays using high-throughput widefield P-SHG microscopy. The resulting P-SHG parameters are used in classification to differentiate tumor from normal tissue, resulting in 94.2% for both accuracy and F1-score, and 6.3% false discovery rate. Subsequently, the trained classifier is employed to predict tumor tissue with 91.3% accuracy, 90.7% F1-score, and 13.8% false omission rate. As such, we show that widefield P-SHG microscopy reveals collagen ultrastructure over large tissue regions and can be utilized as a sensitive biomarker for cancer diagnostics and prognostics studies.

Recently, widefield SHG microscopy was shown to significantly reduce the time required for large-area imaging compared to laser scanning systems 18 . Based on this technology, we have developed a high-throughput quantitative imaging technique by integrating P-SHG with widefield microscopy. Widefield P-SHG microscopy enables rapid scan-less imaging of large sample regions ( ∼ several millimeters) using 16 orthogonal polarization states. Moreover, the subsequent image processing avoids time consuming pixel-by-pixel model fitting, and provides a series of polarimetric parameters that highlight the ultrastructural properties of the collagenous ECM. Furthermore, morphological organization of collagen fibers in the ECM of cancer tissue can be investigated using texture analysis of polarimetric parameters [19][20][21][22] . Texture analysis of P-SHG polarimetric parameters was previously utilized in studies of lung cancer, revealing significant features that aid in characterization of tumor tissue 23 .
In this work, polarimetric and texture parameters from widefield P-SHG imaging were used to train a logistic regression classifier, which was able to differentiate normal from tumor tissue with 94.2% accuracy and F1-score, and 6.3% false discovery rate, in human breast tissue microarray slides. The trained classifier was then further evaluated by predicting the presence of tumor on an independent data set, yielding 91.3% accuracy, 90.7% F1-score, and 13.8% false omission rate. Implementation of machine learning into the image postprocessing enhances differentiation of normal and tumor tissues, potentially enabling automated screening of, for example, tissue microarrays, as well as mapping areas of altered collagen to improve histopathologic diagnoses.

Results
Experimental design and overview. We have developed a widefield P-SHG microscopy technique for rapid high-throughput quantitative imaging, which we applied to breast tissue microarrays. The method generates a set of information-rich polarimetric and texture parameters, and utilizes these to classify normal and tumor tissues. Figure 1 shows an overview of the experimental workflow. The tissue microarray slide was imaged under widefield P-SHG microscopy with 16 unique combinations of input and output polarization states (Fig. 1a). These were combined to generate images of SHG Stokes vector elements, in accordance with reduced double Stokes-Mueller polarimetry, as illustrated in Fig. 1b (see Supplementary Information for more details). Images of the following 5 distinct polarimetric parameters were extracted and used to characterize the ultrastructure of collagen in each image pixel: (1) average SHG intensity from all incident circular polarizations, (2) R-ratio, which is the ratio of 2 achiral second-order nonlinear optical susceptibility elements ( χ (2) zzz /χ (2) zxx , where the z-axis is parallel to the collagen fiber axis), (3) degree of circular polarization (DCP), (4) SHG circular dichroism (SHG-CD), and (5) SHG linear dichroism (SHG-LD). Each polarimetric parameter carries unique ultrastructural information about the ECM collagen.
The SHG intensity obtained with circular incident polarization is highly sensitive to the molecular organization of collagen in the focal volume and is independent of the in-plane orientation of collagen fibers. Polarity of neighboring collagen fibrils, tilt angle of fibrils out of the image plane, crossing of fibrils, and distances between the fibrils influence the SHG intensity 10,24,25 . It has been shown that the SHG intensity is reduced in various solid tumors 26,27 . The R-ratio describes the structural organization of the collagen fibers in the focal volume and it is sensitive to the ultrastructure of the fibers, fiber tilt angles out of the image plane and crossing of the fibers. R-ratio has been successfully used to differentiate normal and malignant tissues in lung, breast, thyroid, and pancreas [11][12][13]16 . The value of DCP is closely related to the R-ratio in non-scattering tissue and reflects the disorder and depolarization in the sample 28,29 . SHG-CD depends on the R-Ratio, out-of-plane fiber tilt angle and polarity, and the phase retardance between the achiral and chiral susceptibility components 10,24,30,31 . SHG-CD has been utilized in investigations of ovarian cancer 25 . Here, we introduce SHG-LD as a new parameter in P-SHG microscopy to study the in-plane organization of collagen fibers in the ECM. Recent investigation show that the collagen fiber orientation computed from SHG-LD is slightly altered due to R-Ratio and complex chiral susceptibilities 24 . A similar definition of SHG-LD was previously used in surface SHG measurements of Langmuir-Blodgett films of chiral polymers functionalized with a nonlinear optical chromophore 32 . In this work, we use SHG intensity of circularly polarized light to assess the presence of aligned collagen fibers in the tissue. R-Ratio is used to determine subtle variations in molecular composition of collagen fibers, and DCP is employed as a measure of depolarization as a result of fiber chirality and out-of-plane orientation. Although complex, SHG-CD and SHG-LD of the ECM are treated as approximations for the out-of-plane and in-plane fiber orientation, assuming the presence of a single collagen chiral handedness throughout.
The images of each computed polarimetric parameter were further subjected to texture and statistical analyses. Texture analysis is a well-known method for characterizing tissue morphology by analyzing the variation in the neighboring pixel values 33 . It is commonly used as a classification tool through its ability to recognize patterns that may be indistinguishable to the human eye [34][35][36] . Fractal-based texture analysis has shown to be useful in microscopic image analysis 37 . In this work, we will focus on grey-level co-occurrence matrix (GLCM-based) texture analysis. Here, each polarimetric image is comprised of over 4 million pixels. Hence, to enable highresolution mapping of texture parameters and statistical investigations, all polarimetric images were divided into 64 sub-images (Fig. 1c). The mean, mean absolute deviation (MAD), and 5 of the most useful textures of the polarimetric parameters, including contrast, correlation, entropy, angular second moment (ASM), and inverse difference moment (IDM) were computed over all sub-images to characterize the collagen in the tissue (Fig. 1d) 19,20 .
The final stage of the analysis involved machine learning-assisted diagnostics, as shown in Fig. 1e. For this, the combination of mean, MAD and texture of 5 polarimetric parameters of normal and tumor tissue were used to train a logistic regression classifier. The trained classifier was used to perform predictions on an independent set of images and map the ultrastructural properties of collagen, leading to differentiation of normal and tumor tissue. www.nature.com/scientificreports/ Whole slide P-SHG imaging of tissue microarray. The collagen content of a breast tissue microarray was imaged using widefield P-SHG microscopy, resulting in images of whole breast tissue cores that were 0.6 mm in diameter, without the need for raster scanning. Figure 2a shows the annotated H&E-stained microarray of normal (N) and tumor (T) cores. The latter were characterized by estrogen receptor (ER), progesterone receptor (PR) and human epidermal growth factor receptor 2 (HER2) expression, and 3 of the most common subtypes were considered: ER+/PR+/HER2+ (triple positive or +/ + /+ , also known as HER2 positive), ER+/PR+/ HER2− (double positive or +/ + /− , also known as hormone-receptor positive), and ER−/PR−/HER2− (triple negative or −/ − /−) 38 . The breast tissue microarray slide was comprised of 45 tissue cores from 12 different patients. Overall, 15 normal cores and 20 tumor cores (5 triple negative cores, 11 double negative cores, and 4 triple positive cores) were used for further analysis. The SHG intensity images of all cores are seen in Fig. 2b. Each core image is comprised of 2048×2048 pixels, corresponding to an area of approximately 670 µm×670 µm , as shown in Fig. 2c, which corresponds to core 11, as indicated by yellow border in Fig. 2b. The resolving power of the microscope (approximate pixel size of 0.3 µm×0.3 µm ) is shown by an enlarged rectangular region in Fig. 2d, whose location is highlighted with the red rectangle in Fig. 2c. For better viewing of the SHG images, pixels that exhibited signal-to-noise ratio (SNR) <1 were removed. It is clearly seen that tumor tissue has markedly lower SHG intensity than normal, consistent with previous reports in multiple tumor types [11][12][13]16,39 . This is likely due to collagen degradation in the tumor microenvironment which results in randomly oriented fibrils, so resembling a centrosymmetric arrangement. It is also clear that there are considerably fewer pixels with SHG signal present in the tumor tissues, rendering pixel density as an important classification parameter.
Polarimetric SHG imaging and analysis. Representative polarimetric images of normal, triple negative, double positive, and triple positive groups are shown in Fig. 3. The H&E-stained core images (Fig. 3a) were segmented to reveal stained fibrillar components of the ECM using a published convolutional neural network (CNN) based technique 40 (Fig. 3b). H&E segmented images serve as a visual approximation of the total collagen content of the tissue. The true arrangement of the ordered collagen fibers can be visualized with P-SHG microscopy as presented in Fig. 3c. SHG identifies only the ordered collagen structures in the tissue; hence, highlighting a subset of the segmented collagen image.
As shown in Fig. 3d, the tumor cores have larger R-ratio values than normal, similar to earlier reports of human breast tissues imaged with a scanning P-SHG microscope 13 . In addition to the extracted R-ratio, DCP of the SHG signal is obtained. As shown in Fig. 3e, the measured DCP was on average lower in normal tissue compared to each of the tumor groups. www.nature.com/scientificreports/ SHG-CD values ( Fig. 3f) were calculated using polarimetric images with circular input polarizations, and are directly associated with chiral and achiral optical susceptibilities, as well as the out-of-image-plane tilt angle of collagen fibers 10,31 . Tissue samples were sectioned without any fiber orientation bias, resulting in random direction of collagen fibers. Therefore, SHG-CD mean values did not provide useful information for classification of tumor tissue. However, collagen in normal tissue appears wavier, leading to a broader distribution of the SHG-CD. Thus, MAD of SHG-CD distribution was instead used to differentiate between normal and tumor tissues.
The SHG-LD images were constructed using the linear input polarizations (Fig. 3g), and highlight the inplane collagen fiber orientation. Similar to SHG-CD, the SHG-LD distribution appears wider in normal tissue compared to the tumor, reflecting larger variations of the in-plane collagen fiber orientation in normal tissue. The SHG-LD measurements exhibited diverse distribution forms, due to random sectioning of breast cores. Hence, in order to further analyze the degree of variability of in-plane collagen fiber orientation, the MAD of SHG-LD distributions were computed and compared between normal and tumor groups.
Texture analysis and statistical testing. Statistical multiple comparison tests were performed to evaluate the significance of parameter value differences between tumor and normal cores. Each core image was comprised of over 4 million pixels, however, pixels with SNR < 3 were discarded from analysis to ensure reliability of the statistical testing. Each large-area image of the cores was subdivided into 64 sub-images, each with 256×256 pixels (approximately 84 µm×84 µm in area). Sub-images located in the corner of each breast core P-SHG image, as well as those without viable signal were removed from the analysis. The number of pixels with SNR > 3 in each sub-image was used as an important metric, referred to as the SHG pixel density (PD) to indicate the www.nature.com/scientificreports/ abundance of ordered SHG-producing collagen in the tissue. The PD was found to be on average considerably higher in normal tissue than tumor; therefore, it was included in the set of diagnostic polarimetric and texture parameters. The mean, MAD, and texture features of the polarimetric parameters were used in multiple comparisons testing and machine learning-assisted classification. Given the large number of resulting parameters (36), the significance test results are shown as a circular plot in Fig. 4a. The circular plot is divided into four quadrants, representing normal, triple negative, double positive, and triple positive groups. The mean, MAD, and textures of each polarimetric parameter are highlighted by distinct colors. Most differences between normal and tumor groups were statistically highly significant ( p < 0.001 ), as indicated by solid colored links. Links between tumor groups are mostly dashed lines, which indicate significant differences ( 0.001 < p < 0.05 ). Non-significant differences are omitted from the diagram. It is evident that all three considered molecular subtypes of breast cancer are well-differentiated from normal with comparable levels of accuracy using widefield P-SHG.
For a more detailed presentation of the data, boxplots of subsets of the computed parameters are shown in Fig. 4b. Sub-image means of both R-ratio and DCP showed similar trends of being significantly higher in tumor than normal tissue, with triple positive breast cancer possessing the largest R-ratio and DCP. The R-ratio results are in agreement with a previous investigation on small breast tissue regions 13 . The MAD of both SHG-CD and SHG-LD highlighted larger variation of the collagen fibers orientation in normal tissue, compared to tumor tissue. Between tumor groups, triple positive had the largest variation in fiber orientation, approaching that of the normal group. Three of the parameters (SHG-CD mean, SHG-LD mean, and R-ratio IDM) were not significantly different between normal and any of the tumor groups, as indicated by x's on their corresponding nodes in Fig  4a. As such, these parameters were omitted from classification.

Classification and prediction of normal and tumor breast tissue.
To showcase the efficacy of collagen as a diagnostic biomarker and display the detection capabilities of widefield P-SHG microscopy, the polarimetric parameter statistics (mean and MAD) and texture features were used to train a binary logistic regression classifier. Prior to classifier training, two cores (one from normal group and one from tumor group) were removed to form a prediction dataset, which was used to further investigate the predictive power of the classifier.  www.nature.com/scientificreports/ All tumor cores were combined into a single tumor group and by treating each sub-image as an individual data point, 544 normal and 543 tumor data points (samples), across 33 different polarimetric and texture parameters (predictors) were used for classification. The training was repeated 1000 times with random stratified partitioning of the dataset for fivefold cross-validation, and the standard deviation of performance metrics were used to evaluate the classifier stability 41 .
Using the complete dataset, the classifier differentiated tumor from normal tissue with 94.2% accuracy at 50% posterior probability threshold, as indicated in Fig 5a. The threshold-dependent classification performance metrics (true negative rate, true positive rate, negative predictive value, and positive predictive value) were within 93-95%. The complimentary metrics (false positive rate, false negative rate, false omission rate, and false discovery rate) were within 5-7%. In addition, the F1-score of 94.2% demonstrates the excellent robustness and accuracy of the trained classifier in identifying tumor tissue regions.
Two threshold-independent metrics, namely the area under the curve of the receiver operating characteristic (AUROC) and Brier score, were computed to further examine the classifier performance. An AUROC value of 1 is ideal, while 0.5 indicates a completely random classification. The Brier score ranges from 0 (perfectly accurate) to 1 (completely inaccurate). As shown in Fig. 5a, the trained classifier generated an AUROC of 0.984 and Brier score of 0.045, indicating well-differentiated normal and tumor groups and low misclassification rate.
To determine the importance of polarimetric and texture parameters in classifier performance, 4 subsets of the data were considered (Fig. 5b): (1) SHG intensity and PD, which are the most commonly used parameters, (2) SHG intensity and PD, together with corresponding texture analysis, (3) the 5 polarimetric parameters (including SHG intensity) and PD, and (4) the 5 polarimetric parameters and PD with corresponding texture parameters (complete dataset). Classification using only the SHG intensity resulted in the lowest accuracy,   All groups include pixel density. Differences in accuracy, F1-Score, AUROC, and Brier score between groups are statistically highly significant ( p ≪ 0.001 ), as determined by the Kruskal-Wallis H test and the Dunn-Bonferroni post hoc test. Using all polarimetric and texture parameters improves the classification performance across all metrics. All reported accuracy and F1-score values have < 0.4% standard error, and AUROC and Brier score have standard errors of < 0.2% and < 4.5% , respectively. www.nature.com/scientificreports/ F1-score and AUROC values, and the largest Brier Score. Adding texture analysis provided a modest improvement of ( ∼ 1% ) across all metrics. A more significant improvement was evident from using all 5 polarimetric parameters (including the SHG intensity), with accuracy and F1-scores reaching > 90% . Finally, combining all polarimetric and texture parameters, produced the best overall performance with accuracy and F1-score > 94% , AUROC > 98% , and Brier score of 0.045. The trained classifier was then used on the prediction set of normal and tumor cores that had not been part of the training set. As depicted in Fig. 5a, the prediction highlighted sub-images with normal and tumor properties, resulting in 91.3% accuracy, 0.990 AUROC, and 0.067 Brier score. Normal tissue was better predicted than tumor tissue, with a true negative rate of 98.0% compared to a true positive rate of 84.6%. However, the positive predictive value was 97.8%, resulting in an F1-score of 90.7%.
Discussion. For the first time, we have presented the application of widefield P-SHG microscopy of the collagenous extracellular matrix as a potential histopathological biomarker for cancer diagnostics. The instrument provides rapid high-resolution large-area imaging of tissue samples such as the microarrays shown here, while providing comprehensive ultrastructural information through 33 quantifiable parameters. The lack of moving optics, such as scanning mirrors, in widefield imaging significantly contributes to the robustness of the technology. This enables simple construction and customization of the microscope, that is well suited for user-friendly high-throughput applications.
To assess the effects of high-powered laser exposure on biological tissue during widefield P-SHG microscopy, we have measured both Multi-Photon Excitation Fluorescence (MPF) and SHG, and the details are reported elsewhere 24 . In short, 16 polarization states were measured twice for improved SNR and averaged, sustaining ∼ 8 minutes of laser exposure in the process. This resulted in less than 5% MPF bleaching and an increase in measured SHG intensity to a much lesser extent (less than 1%). The minimal SHG intensity increase is due to bleaching of the eosin dye, which likely reduces the reabsorption of the SHG signal. Despite the high laser powers, the relatively low NA excitation lens illuminates a very large section of the tissue ( ∼ 1 mm in diameter), thus, effectively reducing the energy density at the region of interest. Without visual signs of tissue damage and with negligible signal change, we conclude that widefield P-SHG microscopy can be used for imaging under optimized experimental conditions.
Overall trends in many of the measured parameters showed a clear distinction between tumor and normal tissues. In particular, triple positive tissue had the greatest differences from normal tissue across most parameters. However, this distinction was small compared to the overall difference between normal and all tumor groups, as further illustrated by multiple comparisons tests on the circular plot of Fig. 4a.
Classification enabled an efficient use of all measured parameters, allowing for accurate predictions of normal and tumor tissue with performance comparable to other methods involving machine learning-assisted cancer diagnostics [42][43][44] . The pixel density and average SHG intensity were amongst the most informative measured parameters, with substantial differences in the ECM between normal and tumor tissues. It is important to note that simply the absence of SHG signal and, thus, lower pixel density and average SHG intensity are not always indicative of the presence of tumor; sparsely distributed collagen fibers and adipose tissue often do not result in sufficient SHG signal, therefore, such regions must be discarded to avoid inflation of false positive rates. Here, empty regions in the corners of the breast core images, and cores which were mainly comprised of adipose tissue were manually discarded. However, in future developments, H&E images may be used simultaneously with widefield P-SHG to identify and discard such areas automatically. Moreover, most breast cores showed a significant reduction in ordered collagen, as indicated by lower SHG intensity compared to normal tissue. However, 2 cores showed a high degree of stroma, and due to rarity of such tissue types in the microarray, they were discarded from classification training data.
The calculated classification metrics showed robust and accurate tissue differentiation with 94.2% accuracy, 6.3% false discovery rate, AUROC of 0.98, and Brier score of 0.045. Capabilities of the trained classifier were further demonstrated in predicting an independent test dataset with an accuracy of 91.3%, false omission rate of 13.8%, AUROC of 0.990, and a Brier score of 0.067. It is important to note that the number and size of the subimages used, affect the predictive power of the classifier. For example, subdividing the images into fewer but larger sub-images would improve the accuracy and specificity, at the cost of decreased classification stability, resulting in reduced reliability. We found that subdivision level of 64 sub-images per image delivers highly accurate classification with sufficient robustness. We present a detailed investigation of the optimum number of sub-images for classification of the P-SHG data in the supplementary information document (see Supplementary Fig. 2).
A natural extension to this work is introduction of chiral nonlinear susceptibility components. Consequently, additional polarimetric parameters such as chiral second order nonlinear optical susceptibility ratio ( C = χ (2) xyz /χ (2) zxx ) may be introduced to further improve molecular identification, based on varied polarity of the collagen fibers throughout the tissue sample. The C-ratio has been previously investigated in scanning P-SHG systems, revealing information on collagen fiber polarity and organization 10,48 .
In this study, we demonstrated widefield P-SHG microscopy as a potential tool in cancer diagnostics. However, it is important to also identify the applications of the microscope in studying other pathologies related to remodeling of the collagen structure in the ECM and in other collagenous tissues. As an example, widefield P-SHG microscopy may be used to identify large-scale ultrastructural changes in pathologies relating to abnormalities in fibrillar collagen types I, II, and III such as, arterial aneurysms, chondrodysplasias, osteogenesis imperfecta, osteoporosis, osteoarthrosis, intervertebral disc disease, and Ehlers-Danlos syndrome 45 . Scanning P-SHG has also been used to study muscle ultrastructure in rat and drosophila larvae 46,47 . Future applications of the widefield P-SHG may include high-throughput investigations of human muscle pathologies such as muscular dystrophy and multiple sclerosis. In addition, continuous and uniform illumination of the entire imaged area provided www.nature.com/scientificreports/ by the widefield P-SHG microscope, enables dynamic imaging of processes involving fast kinetics, such as live ultrastructural imaging of contracting muscle fibers 18 and deformation of collagen fibers during application of external forces 51 .

Methods
Widefield P-SHG microscopy. A custom microscope was designed and constructed for widefield polarimetric SHG imaging as shown in Fig. 6. A high-power amplified laser (PH1-15W, Light Conversion) was used for large area illumination. The laser power was set to 1.3 W, at 100 kHz repetition rate, and 13 µJ pulse energy. The laser beam was collimated to 4 mm diameter and coupled to the microscope. The beam passed through an infrared polarizing cube (PBS102, Thorlabs) to produce a linearly polarized state, followed by an infrared liquid crystal variable retarder (LCC1223-B, Thorlabs), referred to as the polarization state generator (PSG-LCVR), which was positioned with its fast axis at 45 • to the incident linear polarization. An achromatic doublet (AC254-030-AB, Thorlabs) with a focal length of 30 mm was used to focus the beam on the sample. Widefield imaging was achieved by placing the sample above the focal plane and adjusting the illumination area ( ∼ 1 mm in diameter) through axial translation of the excitation achromatic doublet. A Plan-Neofluar 20×/0.50 air collection objective (420350-9900-000, Zeiss) was used to collect the SHG signal radiated in the forward direction. The polarization state analyzer (PSA-LCVR), comprised of a LCVR in the visible wavelength range (LCC1223-A, Thorlabs), was used to probe the outgoing polarization of the SHG signal (also positioned with its fast axis at 45 • to the incident linear polarization). The SHG signal was passed through a tube lens (452960-0000-000, Zeiss) and a visible range polarizing beam splitter (PBS251, Thorlabs). A dichroic mirror (FF662-FDi02-t3-25x36, Semrock) was used to separate the visible SHG signal from infrared fundamental light. The infrared light trans- www.nature.com/scientificreports/ mitted through the dichroic mirror was blocked by a beam dump. The reflected SHG signal was filtered by two BG40 colored glass filters (FGB37-A, Thorlabs), a 515/10 nm interference filter (65-153, Edmund Optics), and projected onto a sCMOS camera (ORCA-Flash 4.0 V2, Hamamatsu).
To carry out polarimetric measurements, four orthogonal incoming polarization states were generated including left circular polarized (LCP), horizontally linearly polarized (HLP), right circularly-polarized (RCP), and vertically linearly polarized (VLP), corresponding to quarter-wave ( /4), half-wave ( /2), three-quarter-wave (3 /4), and full-wave ( ) retardances of the PSG-LCVR, respectively. For each incoming polarization state, four SHG signal polarization states (set by the PSA-LCVR) corresponding to the same set of retardance values (quarterwave, half-wave, three-quarter-wave and full-wave) were measured, resulting in 16 combinations of polarization states. Each polarization state was imaged with a 10 s exposure time, and 5 s of delay time was used for switching polarization states between exposure times, so that polarimetric measurement of each 670 µm×670 µm imaged area was completed in 4 min. Therefore, P-SHG imaging of the entire microarray was achieved in approximately 3 h. It is important to mention that whole-microarray imaging time may be significantly reduced by decreasing the polarization switch delay time to <1 s, in which case, whole-array P-SHG measurement would be complete in <2.5 h.
Polarimetric parameter calculations. The polarization state of the SHG signal can be characterized by a Stokes vector 49,50 where s 0 is the total intensity, s 1 , s 2 are the linearly polarized Stokes vector components, and s 3 is the circularlypolarized Stokes vector component. The measured SHG of the 16 polarization state combinations were used to compute SHG Stokes vector elements. Using these elements, polarimetric parameters of interest were calculated to reveal ultrastructural properties of interest, including: the average SHG intensity produced with circularlypolarized incident light; a ratio of two achiral second order nonlinear optical susceptibility tensor elements, χ (2) zzz /χ (2) zxx (R-ratio), where z points along the fiber axis; the degree of circular polarization (DCP); SHG circular dichroism (SHG-CD); and SHG linear dichroism (SHG-LD). R-ratio and DCP have been shown previously to provide valuable information on the ultrastructural and molecular organization of collagen fibers, while SHG-CD and SHG-LD probe the three-dimensional fiber orientation [11][12][13]16,48,51,52 . For detailed calculation of all the Stokes vector elements, refer to supplementary information document.
The orientation-independent SHG, I CP , of each core was computed from the average SHG signal from right and left (RCP and LCP) circular polarizations of the incident light where the superscripts refer to the PSG polarization states, and the subscripts identify Stokes vector components of SHG signal. The R-ratio was calculated in terms of 1st and 4th elements of the measured SHG Stokes vector 55 : DCP was defined as the average of the magnitude of circular Stokes element, over the total intensity, such that 56 In addition, we computed SHG-CD, which was conventionally expressed as [53][54][55] and SHG-LD as: where both SHG-CD and SHG-LD range from − 2 to 2.
Texture analysis. Textural parameters were extracted from a grey-level co-occurrence matrix (GLCM), which is a second-order statistical representation of the grey-level distribution in a region of interest 33 . The GLCM was built by counting the occurrence of a pixel of grey level i followed by a pixel of grey level j at a distance d along a direction specified by angle θ . In the case of nearest neighbors ( d = 1 ), the four angles of analysis were θ = {0 • , 45 • , 90 • , 135 • } . GLCMs of all four angles were constructed and averaged, forming a directionindependent GLCM. The probability density function, P d,θ (i, j) , of finding certain grey-level pairs was obtained through normalization of the resulting GLCM. From this, we calculated the five textural parameters of interest, as described by Haralick et.al 33 .  www.nature.com/scientificreports/ A custom-built texture analysis program was written in MATLAB, taking advantage of the available functions. The size of the GLCM depends on the number of grey levels ( N g ) specified by the user. Since the program runtime was highly dependent on N g , a testing routine was carried out to determine the optimal N g , represented by discretization level of continuous polarimetric parameters (refer to supplementary information and Supplementary Fig. 1 for detailed calculation of N g ).
The presence of significant background in widefield P-SHG imaging results in highly skewed texture parameter distributions, leading to reduced differentiation between the groups. As such, the background pixels were converted to "not a number" (NaN) and were not included in the analysis. Consequently, the texture parameters only reflect the structure of the signal-producing entities, such as collagen fibers.
The measured texture parameters were contrast, correlation, entropy, angular second moment (ASM) and inverse difference moment (IDM). The contrast, ν , is defined by: which is a measure of pixel value differences between grey-level pairs. Contrast is highly sensitive to variation in neighboring pixel values and it can be used to probe collagen fiber density in the tissue.
Correlation, ρ , quantifies a linear dependence between grey-level pairs, is expressed as: where µ is the mean and σ is the standard deviation of the grey levels. The correlation ranges from − 1 to 1 for perfectly negatively or perfectly positively correlated images, respectively. In the context of widefield polarimetric SHG microscopy, correlation may be used to showcase well-defined structures and patterns in the acquired images. Entropy, S, is defined by: which measures the level of disorder or lack of spatial organization of the grey levels. Entropy increases as disorder increases; however, it is also indicative of the size of the pixel clusters, which often represent bundles of collagen fibers. Conversely, angular second moment (ASM) is a measure of orderliness and ranges from 0 to 1, where 1 is characteristic of a completely uniform image. ASM also varies with the size of pixel clusters of comparable values and is a measure of the average size of similar collagen fiber bundles in an image. ASM is expressed as: Finally, the inverse difference moment (IDM) describes the homogeneity of an image. Similar to ASM, the range of IDM is from 0 to 1, where 1 is indicates of a completely uniform image. IDM is provided by: Statistical analysis and machine learning. Each polarimetric image was divided into 64 sub-images for high-resolution texture analysis, and to compute the statistics of polarimetric parameters. The performance and stability of the logistic regression classifier in differentiating tumor from normal tissue were considered to establish the optimal discretization level of 64 sub-images for each polarimetric image (see supplementary information and Supplementary Fig. 2 for more detail). The results indicated a range of optimal discretization levels from 4 to 64 sub-images per image. The latter was chosen, since it results in a large AUROC and Brier scores, while performing with the highest stability.
Sub-images without SHG signal were discarded and the mean and mean absolute deviation (MAD) of all remaining sub-images were computed using standard MATLAB toolboxes. To reduce the effects of extreme outliers, data points below the 1st, and above the 99th percentiles of parameters in each group were discarded. These statistics were then used in further multiple comparisons tests to establish statistical significance between various groups. There were 544 normal, 103 triple negative, 325 double positive, and 115 triple positive data points across 36 polarimetric and texture parameters. Most parameters did not form normal distributions as indicated by Q-Q plots, and the Shapiro-Wilk test. Hence, MATLAB's Kruskal-Wallis H test with 1086 degrees of freedom, along with the Dunn-Bonferroni post hoc test, were used to determine the significance of the difference between normal and all three tumor groups, for all polarimetric and texture parameters, separately. Differences with p-values < 0.05 and < 0.001 were considered to be significant and highly significant, respectively. (i − j) 2 P d,θ (i, j), www.nature.com/scientificreports/ Breast tissue microarray. All human tissues used in this study were obtained with written informed patient consent. The study was approved by the Research Ethics Board of University Health Network, Toronto, Canada. The study was performed in accordance with relevant guidelines and regulations and in compliance with the tenets of the Declaration of Helsinki. The breast tissue microarray slide contained 45 tissue cores from 12 different patients, collected using a 0.6 mm diameter biopsy needle and stabilized in phosphate-buffered formalin. The sample was embedded in paraffin and stored at room temperature. The cores were cut to 5 µm thickness, stained with H&E and imaged with a bright-field microscope scanner (Aperio Whole Slide Scanner, Leica Biosystems) for reference. The microarray contained normal tissue, as well as tissue from 3 different tumor subtypes, including ER−/PR−/HER2− (triple negative or −/ − /− ), ER+/PR+/HER2− (double positive or +/ + /− , also known as hormone-receptor positive), and ER+/PR+/HER2+ (triple positive or +/ + /+ , also known as HER2 positive), where ER, PR, and HER2 indicate estrogen receptor, progesterone receptor, and human epidermal growth factor receptor 2, respectively. Cores that were mainly comprised of adipose tissue did not produce significant SHG signal and were removed from the analysis. In addition, two tumor cores containing significant collagenous stroma with minor tumor foci were excluded, due to their rarity in the microarray. In total, 15 normal cores, as well as 5 triple negative cores, 11 double negative cores, and 4 triple positive cores (20 tumor cores in total) were considered. The tissue assessment and tumor identification were performed by expert pathologists (S.J.D. and E.Z.).

Data availability
All data generated and analysed during this study are included in this published article and its supplementary information files. Raw image files are available from V.B. upon request.